library(tidyverse)
library(SingleCellExperiment)Horizontal Data Integration
To start, we will load the tidyverse and SingleCellExperiment packages:
Example data
We will use a popular dataset by Kang et al. (2018) for this tutorial. The dataset measured the effect of interferon-\(\beta\) stimulation on blood cells from eight patients. The muscData package provides an easy way to access the data as a SingleCellExperiment. In case downloading the data with the muscData package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce <- readRDS("~/Downloads/PATH_TO_THE_FILE").
sce <- muscData::Kang18_8vs8()see ?muscData and browseVignettes('muscData') for documentation
loading from cache
sceclass: SingleCellExperiment
dim: 35635 29065
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
You can find the number of genes and cells when printing the summary of the sce. Alternatively, you can use nrow(sce), ncol(sce), or dim(sce) to gt these values.
In a SingleCellExperiment object, the meta information about each cell using colData(sce) and for each gene with rowData(sce).
Follow-up question: why does the documentation for colData (run ?colData in the R console) talk about SummarizedExperiment objects instead of SingleCellExperiment?
We log-transform the data to account for the heteroskedasticity of the counts, perform PCA to reduce the dimensions, and run UMAP for visualization. For the preprocessing, we will use the scater package, which adds a new assay called "logcounts" and two reducedDims(sce) called "PCA" and "UMAP" to the SummarizedExperiment object. Equivalent preprocessing functions also exist in Seurat or scanpy.
sce <- scater::logNormCounts(sce)
hvg <- order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce <- sce[hvg[1:500], ]
sce <- scater::runPCA(sce, ncomponents = 50)
sce <- scater::runUMAP(sce, dimred = "PCA")The transformGamPoi package from Bioconductor provides a residual_transform function.
assay(sce, "pearson_residuals") <- transformGamPoi::residual_transform(sce, residual = "pearson", on_disk = FALSE)Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.
Too many dimensions for PCA can also pose a problem. PCA smoothes out the Poisson noise uncorrelated that is orthogonal to the biological signal. If too many PCA components are included, the additional dimensions capture the noise that can obscure relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).
The scater package also provides a runTSNE function
sce <- scater::runTSNE(sce, dimred = "PCA")On the one hand, tSNE and UMAP are defacto standards for visualizing the cell states in single cell data analysis. On the other hand, they are often criticized for failing to represent global structure and distorting differences in the data (just look at twitter). A number of alternatives have been suggested:
- Force directed layout of the \(k\) nearest neighbor (kNN) graph (igraph),
- PHATE
- IVIS
- Unification of tSNE and UMAP using contrastive learning (Böhm, 2023)
To visualize the data, we use ggplot2. This is often more verbose than calling an existing function (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")) but has the advantage that the plots are easier to customize.
In the UMAP, the cells separate by treatment status ("stim") and cell type ("cell"). The goal of this tutorial is to understand how different models adjust for the known treatment status and integrate the data into a shared embedding.
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()This dataset already comes with cell type annotations. The cell type labels are helpful for interpreting the results; however, for the purpose of this tutorial, we will not need them and will ignore them from now on.
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = cell), size = 0.3) +
ggrepel::geom_text_repel(data = \(dat) dat |> summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell)),
aes(label = cell)) +
coord_fixed()Warning: Removed 2 rows containing missing values (`geom_text_repel()`).
The following code is based on Seurat’s Guided Clustering Tutorial.
# For more information about the conversion see `?as.Seurat.CellDataSet`
seur_obj <- Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj <- Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj <- Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
# Subset to highly variable genes for memory efficiency
seur_obj <- seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj <- Seurat::ScaleData(seur_obj)
seur_obj <- Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj <-Seurat::RunUMAP(seur_obj, dims = 1:10)Seurat::DimPlot(seur_obj, reduction = "umap", group.by = "stim")Data integration
Figure 1 shows that the data separates by the treatment status. For many downstream analyses, it would be good to know how the cells from the stimulated condition are related to the cells from the control condition. For example for cell type assignment, we might want to annotate both conditions together and ignore the effect of the treatment. This process is called integration.
The goal is a low-dimensional embedding of the cells where the treatment status does not affect the embedding and all residual variance comes from different cell states. Figure 3 shows a successfully integrated example.
There are many methods for single-cell data integration, and Luecken et al. (2022) benchmarked several approaches. Here, I will present four integration methods that are easy to use from R and cover a useful set of features:
- Manual projection
- Automated integration
- Harmony
- MNN
- Invertible integration
- LEMUR
A 2D embedding like UMAP or tSNE gives us a first impression if cells from different conditions are mixed, but the results are not quantitative which means we cannot directly compare the outcome.
One simple metric to measure integration success is to see how mixed the conditions are. For example we can count for each cell how many of the nearest neighbors come from the same condition and how many come from the other conditions. For more information see Luecken et al. (2022).
Follow-up challenge: Write a function to calculate the mixing metric.
Follow-up questions: Can you write an integration function, that scores perfectly on the integration metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such a pathological example.
Manual Projection
In transcriptomic data analysis, each cell is characterized by its gene expression values. In our case, these are the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a \(P\)-dimensional space with minimal distance to each cell.
To make these concepts more tangible, I created a cartoon shown in Figure Figure 4. The orange blob represents all cells from the control condition in a 3-dimensional gene expression space. The grey rectangle \(R\) is a lower-dimensional subspace covering the shape of the orange blob. The shape of the blue blob (i.e., the treated cells) resembles the orange blob but is slightly offset. To integrate both conditions, we can project each point from the blue blob onto the subspace covering the orange blob.
We can implement this procedure in a few lines of R code:
- We create a matrix for the cells from the control and treated conditions,
- we fit PCA on the control cells,
- we center the data, and finally
- we project the cells from both conditions onto the subspace of the control condition
# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
ctrl_mat <- as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
stim_mat <- as.matrix(logcounts(sce)[,sce$stim == "stim"])
ctrl_centers <- rowMeans(ctrl_mat)
stim_centers <- rowMeans(stim_mat)
# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
ctrl_pca <- irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)
# With a little bit of linear algebra, we project the points onto the subspace of the control cells
integrated_mat <- matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat[,sce$stim == "ctrl"] <- t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] <- t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)We check that our implementation works, by looking at the UMAP of the integrated data.
int_mat_umap <- uwot::umap(t(integrated_mat))
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()The overlap is not perfect, but better than in Figure 1!
"stim" condition?
The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.
stim_pca <- irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)
integrated_mat2 <- matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2[,sce$stim == "ctrl"] <- t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] <- t(stim_pca$rotation) %*% (stim_mat - stim_centers)
int_mat_umap2 <- uwot::umap(t(integrated_mat2))
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap2) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()For this example, using the "stim" condition as the reference leads to a worse integration.
The projection approach consists of three steps:
- Centering the data (e.g.,
ctrl_mat - ctrl_centers). - Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (
irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation). - Projecting the data from the other conditions onto that subspace (
t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).
For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.
# A complex experimental design
lm_fit <- lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
# The `residuals` function returns the coordinates minus the mean at the condition.
centered_mat <- t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca <- irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
int_mat <- t(ref_pca$rotation) %*% centered_matAutomatic integration
In the following, I show three approaches for automatically integrating the data. You don’t need to run all three. Usually, you pick the method that you like best. Here, I provide the code for all three so that you can compare for yourself.
Harmony
Harmony is one popular tool for data integration (Korsunsky et al. 2019). Harmony is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 6). Unlike classical clustering algorithms, maximum diversity clustering not just minimizes the distance of each data point to a cluster center but also maximizes the mixing of conditions assigned to each cluster.
# Warning: You need `packageVersion("harmony") >= "1.0.0"` for this to work.
harm_mat <- harmony::RunHarmony(reducedDim(sce, "PCA"), colData(sce),
vars_use = c("stim"))Transposing data matrix
Initializing state using k-means centroids initialization
Warning: Quick-TRANSfer stage steps exceeded maximum (= 1453250)
Harmony 1/10
Harmony 2/10
Harmony 3/10
Harmony 4/10
Harmony converged after 4 iterations
harm_umap <- uwot::umap(harm_mat)
as_tibble(colData(sce)) |>
mutate(umap = harm_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()MNN
MNN is short for mutual nearest neighbors and was invented for integrating two conditions by identifying the cells that are mutually nearest neighbors Figure 7. The batchelor package provides an efficient implementation that can also handle experimental designs with more than two conditions.
mnn_sce <- batchelor::fastMNN(sce, batch = sce$stim, BSPARAM=BiocSingular::IrlbaParam())
mnn_umap <- uwot::umap(reducedDim(mnn_sce, "corrected"))
as_tibble(colData(sce)) |>
mutate(umap = mnn_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()Seurat’s integration method is very similar to MNN (however, it calls the mutual nearest neighbors integration anchors). The main difference is that Seurat uses CCA instead of PCA for dimensionality reduction (Butler et al. 2018).
# This code only works with Seurat v5!
seur_obj2 <- Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj2[["originalexp"]] <- split(seur_obj2[["originalexp"]], f = seur_obj2$stim)
seur_obj2 <- Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj2 <- Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)
seur_obj2 <- Seurat::ScaleData(seur_obj2)
seur_obj2 <- Seurat::RunPCA(seur_obj2, verbose = FALSE)
seur_obj2 <- Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
seur_obj2 <- Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")
Seurat::DimPlot(seur_obj2, reduction = "umap", group.by = "stim")Invertible Integration
Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. However, there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.
A new tool called LEMUR provides this functionality by matching the subspace of each condition (Constantin Ahlmann-Eltze and Huber 2024). Figure 8 illustrates the principle.
LEMUR takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we will use the provided cell type annotations.
fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_by_grouping(fit, fit$colData$cell, verbose = FALSE)fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_harmony(fit, verbose = FALSE)Making a UMAP plot of LEMUR’s embedding shows that it successfully integrated the conditions (Figure 9).
lemur_umap <- uwot::umap(t(fit$embedding))
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()The advantage of the invertible integration is that we can predict what a cell’s expression from the control condition would have been, had it been stimulated and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.
We call LEMUR’s test_de function to compare the expression values in the "stim" and "ctrl" conditions.
fit <- lemur::test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))We can now pick individual genes and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 10).
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(expr = logcounts(fit)["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = expr), size = 0.3) +
scale_color_viridis_c() +
facet_wrap(vars(stim)) +
coord_fixed()
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2() +
coord_fixed()nei <- lemur::find_de_neighborhoods(fit, group_by = vars(stim, ind), verbose = FALSE)
as_tibble(nei) %>%
arrange(pval)Session Info
sessionInfo()R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.7.10
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] muscData_1.14.0 ExperimentHub_2.8.0
[3] AnnotationHub_3.8.0 BiocFileCache_2.8.0
[5] dbplyr_2.3.2 SingleCellExperiment_1.22.0
[7] SummarizedExperiment_1.30.2 Biobase_2.60.0
[9] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
[11] IRanges_2.34.0 S4Vectors_0.38.1
[13] BiocGenerics_0.46.0 MatrixGenerics_1.12.2
[15] matrixStats_1.0.0 lubridate_1.9.2
[17] forcats_1.0.0 stringr_1.5.0
[19] dplyr_1.1.2 purrr_1.0.1
[21] readr_2.1.4 tidyr_1.3.0
[23] tibble_3.2.1 ggplot2_3.4.4
[25] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.14 jsonlite_1.8.8
[3] magrittr_2.0.3 ggbeeswarm_0.7.2
[5] farver_2.1.1 rmarkdown_2.22
[7] zlibbioc_1.46.0 vctrs_0.6.2
[9] memoise_2.0.1 DelayedMatrixStats_1.22.0
[11] RCurl_1.98-1.12 htmltools_0.5.5
[13] S4Arrays_1.0.4 curl_5.0.1
[15] BiocNeighbors_1.18.0 htmlwidgets_1.6.2
[17] cachem_1.0.8 ResidualMatrix_1.10.0
[19] igraph_1.4.3 mime_0.12
[21] lifecycle_1.0.3 pkgconfig_2.0.3
[23] rsvd_1.0.5 Matrix_1.6-4
[25] R6_2.5.1 fastmap_1.1.1
[27] GenomeInfoDbData_1.2.10 shiny_1.7.4
[29] digest_0.6.31 colorspace_2.1-0
[31] AnnotationDbi_1.62.1 scater_1.28.0
[33] irlba_2.3.5.1 RSQLite_2.3.1
[35] beachmat_2.16.0 filelock_1.0.2
[37] labeling_0.4.2 fansi_1.0.4
[39] timechange_0.2.0 httr_1.4.6
[41] compiler_4.3.2 bit64_4.0.5
[43] withr_2.5.0 BiocParallel_1.34.2
[45] viridis_0.6.3 DBI_1.1.3
[47] rappdirs_0.3.3 DelayedArray_0.26.3
[49] lemur_1.0.5 tools_4.3.2
[51] vipor_0.4.5 beeswarm_0.4.0
[53] interactiveDisplayBase_1.38.0 httpuv_1.6.11
[55] glmGamPoi_1.13.3 glue_1.6.2
[57] batchelor_1.16.0 promises_1.2.0.1
[59] grid_4.3.2 generics_0.1.3
[61] gtable_0.3.3 tzdb_0.4.0
[63] hms_1.1.3 BiocSingular_1.16.0
[65] ScaledMatrix_1.8.1 utf8_1.2.3
[67] XVector_0.40.0 RcppAnnoy_0.0.20
[69] ggrepel_0.9.3 BiocVersion_3.17.1
[71] pillar_1.9.0 later_1.3.1
[73] lattice_0.21-9 bit_4.0.5
[75] tidyselect_1.2.0 Biostrings_2.68.1
[77] scuttle_1.10.1 knitr_1.43
[79] gridExtra_2.3 RhpcBLASctl_0.23-42
[81] xfun_0.39 stringi_1.7.12
[83] yaml_2.3.7 evaluate_0.21
[85] codetools_0.2-19 BiocManager_1.30.20
[87] cli_3.6.1 uwot_0.1.14
[89] xtable_1.8-4 munsell_0.5.0
[91] harmony_1.2.0 Rcpp_1.0.10
[93] png_0.1-8 parallel_4.3.2
[95] ellipsis_0.3.2 blob_1.2.4
[97] sparseMatrixStats_1.13.4 bitops_1.0-7
[99] viridisLite_0.4.2 scales_1.2.1
[101] crayon_1.5.2 rlang_1.1.1
[103] cowplot_1.1.1 KEGGREST_1.40.0